Goto

Collaborating Authors

 diffusion loss






Imbalance in Balance: Online Concept Balancing in Generation Models

Shi, Yukai, Ou, Jiarong, Chen, Rui, Yang, Haotian, Wang, Jiahao, Tao, Xin, Wan, Pengfei, Zhang, Di, Gai, Kun

arXiv.org Artificial Intelligence

In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. W e also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes.


REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Leng, Xingjian, Singh, Jaskirat, Hou, Yunzhong, Xing, Zhenchang, Xie, Saining, Zheng, Liang

arXiv.org Artificial Intelligence

In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.



Autoregressive Image Generation without Vector Quantization Tianhong Li

Neural Information Processing Systems

Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokeniz-ers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications. Code is available at https://github.com/LTH14/mar.


Variational Diffusion Models

Neural Information Processing Systems

Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints.


A Distribution details A.1 q (z t | z s) The distribution of z

Neural Information Processing Systems

These are three equally valid views on the same model class, that have been used interchangeably in the literature. The parameterization of our model is discussed in Appendix B . In this section we provide details on the exact setup for each of our experiments. In Sections B.1 we describe the choices in common to each of our experiments. Our models are deeper than those used by Ho et al. [ 2020 ]. Specific numbers are given in Apart from the middle attention block that connects the upward and downward branches of the U-Net, we remove all other attention blocks from the model.